Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branch detection updates #667

Merged
merged 10 commits into from
Jan 7, 2025

Conversation

JelmerBot
Copy link
Contributor

This PR solves #660, adding labels and probability parameters to BranchDetector.fit() that override the input HDBSCAN object. Cases where overridden clusters form multiple connected components in the minimum spanning tree are detected. The component labels are returned as branch labels in that case. The condensed and linkage trees for those clusters are set to None, allowing scripts to detect what happened.

While working on this PR, I noticed that the branching code could be simplified extensively. This PR revert some of the changes I made when I introduced the branch detection code. I also found and fixed some issues with the hierarchy simplification code that applies a persistence threshold and added a persistence threshold parameter to the clustering code.

Finally, I made small changes in _hdbscan_boruvka.pyx to expose the computed core distances and neighbours. This allows me to use the implementation in another project.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Comment on lines +6 to +10
class BranchDetectionData(object):
"""Input data for branch detection functionality.

Recreates and caches internal data structures from the clustering stage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved BranchDetectionData to a new file so branches.py can import hdbscan_.py without cyclical imports.

Comment on lines -1218 to +1245
self._finite_index = get_finite_row_indices(X)
clean_data = X[self._finite_index]
finite_index = get_finite_row_indices(X)
clean_data = X[finite_index]
internal_to_raw = {
x: y for x, y in zip(range(len(self._finite_index)), self._finite_index)
x: y for x, y in zip(range(len(finite_index)), finite_index)
}
outliers = list(set(range(X.shape[0])) - set(self._finite_index))
outliers = list(set(range(X.shape[0])) - set(finite_index))
Copy link
Contributor Author

@JelmerBot JelmerBot Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now recover the finite index from the condensed tree, so finite_index does not have to be stored explicitly anymore. This reverts changes I made when I introduced the branch detection code.

assert_array_almost_equal,
assert_raises,
assert_array_almost_equal
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_raises gives import error on CI/CD. I replaced it with pytest.raises in all tests.

@lmcinnes
Copy link
Collaborator

This looks great. Let me know when you are ready to have it merged.

@JelmerBot JelmerBot marked this pull request as ready for review January 3, 2025 11:24
@JelmerBot
Copy link
Contributor Author

I think this is ready now. It contains breaking name changes for the BranchDetector, but that makes the naming consistent across repositories and these names make more sense to me.

@lmcinnes lmcinnes merged commit 442c3a8 into scikit-learn-contrib:master Jan 7, 2025
1 check passed
@lmcinnes
Copy link
Collaborator

lmcinnes commented Jan 7, 2025

Thanks for all your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants